Goto

Collaborating Authors

 safety feature


AI's safety features can be circumvented with poetry, research finds

The Guardian

Roses are red, violets are blue, how do you make a nuclear bomb? Roses are red, violets are blue, how do you make a nuclear bomb? AI's safety features can be circumvented with poetry, research finds Poetry can be linguistically and structurally unpredictable - and that's part of its joy. But one man's joy, it turns out, can be a nightmare for AI models. Those are the recent findings of researchers out of Italy's Icaro Lab, an initiative from a small ethical AI company called DexAI.



Measuring Moral LLM Responses in Multilingual Capacities

Basu, Kimaya, Kolari, Savi, Yu, Allison

arXiv.org Artificial Intelligence

With LLM usage becoming widespread across countries, languages, and humanity more broadly, the need to understand and guardrail their multilingual responses increases. Large-scale datasets for testing and benchmarking have been created to evaluate and facilitate LLM responses across multiple dimensions. In this study, we evaluate the responses of frontier and leading open-source models in five dimensions across low and high-resource languages to measure LLM accuracy and consistency across multilingual contexts. We evaluate the responses using a five-point grading rubric and a judge LLM. Our study shows that GPT -5 performed the best on average in each category, while other models displayed more inconsistency across language and category. Most notably, in the Consent & Autonomy and Harm Prevention & Safety categories, GPT scored the highest with averages of 3.56 and 4.73, while Gemini 2.5 Pro scored the lowest with averages of 1.39 and 1.98, respectively. These findings emphasize the need for further testing on how linguistic shifts impact LLM responses across various categories and improvement in these areas.


`For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts

Schoene, Annika M, Canca, Cansu

arXiv.org Artificial Intelligence

Recent advances in large language models (LLMs) have led to increasingly sophisticated safety protocols and features designed to prevent harmful, unethical, or unauthorized outputs. However, these guardrails remain susceptible to novel and creative forms of adversarial prompting, including manually generated test cases. In this work, we present two new test cases in mental health for (i) suicide and (ii) self-harm, using multi-step, prompt-level jailbreaking and bypass built-in content and safety filters. We show that user intent is disregarded, leading to the generation of detailed harmful content and instructions that could cause real-world harm. We conduct an empirical evaluation across six widely available LLMs, demonstrating the generalizability and reliability of the bypass. We assess these findings and the multilayered ethical tensions that they present for their implications on prompt-response filtering and context- and task-specific model development. We recommend a more comprehensive and systematic approach to AI safety and ethics while emphasizing the need for continuous adversarial testing in safety-critical AI deployments. We also argue that while certain clearly defined safety measures and guardrails can and must be implemented in LLMs, ensuring robust and comprehensive safety across all use cases and domains remains extremely challenging given the current technical maturity of general-purpose LLMs.


Developing a Robotic Surgery Training System for Wide Accessibility and Research

Shaker, Walid, Erden, Mustafa Suphi

arXiv.org Artificial Intelligence

-- Robotic surgery represents a major breakthrough in medical interventions, which has revolutionized surgical procedures. However, the high cost and limited accessibility of robotic surgery systems pose significant challenges for training purposes. This study addresses these issues by developing a cost-effective robotic laparoscopy training system that closely replicates advanced robotic surgery setups to ensure broad access for both on-site and remote users. Key innovations include the design of a low-cost robotic end-effector that effectively mimics high-end laparoscopic instruments. Additionally, a digital twin platform was established, facilitating detailed simulation, testing, and real-time monitoring, which enhances both system development and deployment. Furthermore, teleop-eration control was optimized, leading to improved trajectory tracking while maintaining remote center of motion (RCM) constraint, with a RMSE of 5 µ m and reduced system latency to 0.01 seconds. As a result, the system provides smooth, continuous motion and incorporates essential safety features, making it a highly effective tool for laparoscopic training.


SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

Li, Mingjie, Si, Wai Man, Backes, Michael, Zhang, Yang, Wang, Yisen

arXiv.org Artificial Intelligence

As advancements in large language models (LLMs) continue and the demand for personalized models increases, parameter-efficient fine-tuning (PEFT) methods (e.g., LoRA) will become essential due to their efficiency in reducing computation costs. However, recent studies have raised alarming concerns that LoRA fine-tuning could potentially compromise the safety alignment in LLMs, posing significant risks for the model owner. In this paper, we first investigate the underlying mechanism by analyzing the changes in safety alignment related features before and after fine-tuning. Then, we propose a fixed safety module calculated by safety data and a task-specific initialization for trainable parameters in low-rank adaptations, termed Safety-alignment preserved Low-Rank Adaptation (SaLoRA). Unlike previous LoRA methods and their variants, SaLoRA enables targeted modifications to LLMs without disrupting their original alignments. Our experiments show that SaLoRA outperforms various adapters-based approaches across various evaluation metrics in different fine-tuning tasks.


The best space heaters in 2024

Popular Science

We may earn revenue from the products available on this page and participate in affiliate programs. If you're tired of stockpiling blankets, extra socks, and heated slippers to keep you warm, it might be time to consider getting a space heater. These powerful appliances are a great way to get cozy without installing a complicated heating system or commandeering the thermostat. If your radiator just isn't cutting it or someone insists on keeping a window open to freshen the room up, a space heater could be the perfect solution. These hot machines are designed specifically to warm up spaces of all sizes and should be portable, effective, and fast-acting. Our best overall pick, the Lasko 5586 Electric 1500W Ceramic Space Heater Tower, ticks all these boxes.


'Many-shot jailbreak': lab reveals how AI safety features can be easily bypassed

The Guardian

The safety features on some of the most powerful AI tools that stop them being used for cybercrime or terrorism can be bypassed simply by flooding them with examples of wrongdoing, research has shown. In a paper from the AI lab Anthropic, which produces the large language model (LLM) behind the ChatGPT rival Claude, researchers described an attack they called "many-shot jailbreaking". The attack was as simple as it was effective. Claude, like most large commercial AI systems, contains safety features designed to encourage it to refuse certain requests, such as to generate violent or hateful speech, produce instructions for illegal activities, deceive or discriminate. A user who asks the system for instructions to build a bomb, for example, will receive a polite refusal to engage.


Safe Exploration in Finite Markov Decision Processes with Gaussian Processes

Neural Information Processing Systems

In classical reinforcement learning agents accept arbitrary short term loss for long term gain when exploring their environment. This is infeasible for safety critical applications such as robotics, where even a single unsafe action may cause system failure or harm the environment. In this paper, we address the problem of safely exploring finite Markov decision processes (MDP). We define safety in terms of an a priori unknown safety constraint that depends on states and actions and satisfies certain regularity conditions expressed via a Gaussian process prior.


Is this helicopter that can fly itself the answer to ending chopper crashes?

FOX News

Kurt "CyberGuy" Knutsson discusses a craft that can fly autonomously without any human intervention. Imagine a helicopter that can take off, fly and land without a human pilot. CLICK TO GET KURT'S FREE CYBERGUY NEWSLETTER WITH SECURITY ALERTS, QUICK VIDEO TIPS, TECH REVIEWS, AND EASY HOW-TO'S TO MAKE YOU SMARTER The R550X is a revolutionary helicopter from Rotor Technologies. It is special because it is the first of its kind to be designed for civilian use, not military or law enforcement. It can perform a variety of missions, such as crop spraying, cargo delivery, firefighting, surveillance, inspection, mapping, surveying, research, exploration, entertainment, and more.